That'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models
نویسندگان
چکیده
Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrimental for MT. We then present a novel ‘sentential’ approach to use this coarse lexical resource from a multilingual topic model. Our coarse lexical resource when injected with a parallel corpus outperforms a system trained using parallel corpus and a good quality lexical resource. As demonstrated by the quality of our coarse lexical resource and its benefit to MT, we believe that our sentential approach to create such a resource will help MT for resource-constrained languages.
منابع مشابه
Using Multilingual Topic Models for Improved Alignment in English-Hindi MT
Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of such dictionaries, a coarse dictionary may be required. This paper demonstrates the use of a multilingual topic model for creating coarse dictionaries for English-Hindi MT. We compare our approaches with: (a) a baseline with no additional dictionary injection, and...
متن کاملDeveloping English-Urdu Machine Translation Via Hindi
The paper presents a strategy for deriving English to Urdu translation using English to Hindi MT system. The English-Hindi lexical database is used to collect all possible Hindi words and phrases. These are further augmented by including their morphological variations and attaching all possible postpositions. This list is used to provide mapping from Hindi to Urdu. There may be change in gender...
متن کاملPhrase Pair Mappings for Hindi-English Statistical Machine Translation
In this paper, we present our work on the creation of lexical resources for the Machine Translation between English and Hindi. We describes the development of phrase pair mappings for our experiments and the comparative performance evaluation between different trained models on top of the baseline Statistical Machine Translation system. We focused on augmenting the parallel corpus with more voc...
متن کاملMachine Translation, Language Divergence and Lexical Resources
The key concern in machine translation, whose purpose it is to convert documents from one language to another, is the language divergence problem. This problem arises from the fact that languages make different lexical and syntactic choices for expressing an idea. Language divergence needs to be tackled not only for translating between language pairs from distant families (e.g, English and Japa...
متن کاملNUS-ML: Improving Word Sense Disambiguation Using Topic Features
We participated in SemEval-1 English coarse-grained all-words task (task 7), English fine-grained all-words task (task 17, subtask 3) and English coarse-grained lexical sample task (task 17, subtask 1). The same method with different labeled data is used for the tasks; SemCor is the labeled corpus used to train our system for the allwords tasks while the labeled corpus that is provided is used ...
متن کامل